Skip to content

[mypyc] Add librt.strings.isidentifier codepoint primitive#21522

Open
VaggelisD wants to merge 4 commits into
python:masterfrom
VaggelisD:librt-strings-isidentifier
Open

[mypyc] Add librt.strings.isidentifier codepoint primitive#21522
VaggelisD wants to merge 4 commits into
python:masterfrom
VaggelisD:librt-strings-isidentifier

Conversation

@VaggelisD
Copy link
Copy Markdown
Contributor

5th PR of #21418

True if a codepoint can start a valid identifier (XID_Start, per
PEP 3131). The ASCII fast path covers `[A-Za-z_]` inline; non-ASCII
codepoints round-trip through PyUnicode_FromOrdinal +
PyUnicode_IsIdentifier so the answer matches str.isidentifier on a
1-character string.

The non-ASCII path is the first allocating helper in this series, so
its body lives out-of-line in codepoint_extra_ops.c (it would otherwise
be emitted as a separate copy in every translation unit that includes
the header). On OOM it swallows the exception via PyErr_Clear() and
returns False, which keeps the function ERR_NEVER. Documented at the
call site so callers don't get a surprising silent failure.

Stack: depends on the librt.strings.isspace primitive.
@VaggelisD VaggelisD force-pushed the librt-strings-isidentifier branch from add1d44 to 283b2f8 Compare May 21, 2026 07:42
@github-actions

This comment has been minimized.

Comment thread mypyc/lib-rt/codepoint_extra_ops.c Outdated
Comment thread mypyc/build.py Outdated
Comment on lines +57 to +62
ModDesc(
"librt.strings",
["strings/librt_strings.c", "codepoint_extra_ops.c"],
["codepoint_extra_ops.h"],
["strings"],
),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for not catching this earlier but it looks like the files are dependent on each other the wrong way. the _extra_ops files should be internal to mypyc and shouldn't be needed when building librt modules.

so instead of including codepoint_extra_ops in librt_strings we need it the other way around, with the common functionality in librt_strings. like it's done with the other _extra_ops files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, that's on me first!

Copy link
Copy Markdown
Collaborator

@p-sawicki p-sawicki May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! i see there's still #include "codepoint_extra_ops.h" in librt_strings.c and it would be good to clean that up by moving the functions from there to librt_strings.h.

that would make the codepoint_extra_ops files empty so maybe they aren't actually needed and we could just compile the functions statically straight from librt_strings.h? unless you're planning on adding functions later that would be better placed outside of librt_strings.h.

Copy link
Copy Markdown
Contributor Author

@VaggelisD VaggelisD May 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning on using it to add functions like toupper(), tolower() etc so that they're separated from librt_strings, what do you think? Worth the complexity?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would there be separate implementations used by the python wrapper in librt_strings and when compiled statically with mypyc? in that case sure, i think it'd make sense to have them in codepoint_extra_ops.

but if they're more similar to the functions added so far where the python wrapper calls the same function that mypyc could generate in the C file then they would have to be in librt_strings. it would be cleaner to keep the separation between the librt modules and C files relevant only in mypyc generated code by not including them by the librt module files.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, they'd be the same function afaict. Do we worry about the librt capsule indirection overhead? I assume these codepoint primitives will participate in hot loops so preferably they'd be inlined instead of going through the librt fp array (?)

I see the overall point though, there's no mypyc callers so there's no real point in having these outside of librt_strings.

Copy link
Copy Markdown
Collaborator

@p-sawicki p-sawicki May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah there would be overhead if the functions are exposed through the C_API array. but i think we could avoid it by defining the C functions inline in librt_strings.h instead. that way when compiling with mypyc, the function would be part of the extension and called statically.

i think this would be better than exposing through C_API especially for the functions that are very simple and thus wouldn't bloat the extension object files. edit: and i think it would be fine to define all the functions you've added so far as inline.

- codepoint_extra_ops.h: include CPy.h and move the isidentifier slow
  path inline into LibRTStrings_IsIdentifier. Aborts via
  CPyError_OutOfMemory on allocation failure (the helper is ERR_NEVER,
  so returning a silently-wrong bool under memory pressure was the wrong
  contract). Matches the pattern in the sibling _extra_ops.h files (all
  bodies static inline, CPy.h included for runtime helpers).
- codepoint_extra_ops.c: reduce to a single-line shim that #includes the
  header. Exists only so SourceDep("codepoint_extra_ops.c") pulls the
  header into user mypyc extensions in include_runtime_files mode.
- build.py / lib-rt/setup.py: drop codepoint_extra_ops.c from the
  librt.strings module sources. The _extra_ops.c files are mypyc-internal
  (linked into user extensions via SourceDep in mypyc/ir/deps.py); the
  librt.strings Python module shouldn't depend on them, matching how
  bytes_extra_ops, str_extra_ops, etc. are organized. librt.strings now
  picks up LibRTStrings_IsIdentifier via #include of the header.
@github-actions

This comment has been minimized.

Per follow-up review on python#21522, the codepoint classifiers belong with
the rest of the librt.strings public surface rather than in a shared
inline header, since they implement Python-visible librt.strings
functions (not mypyc-internal codegen helpers like the other _extra_ops
files). Move them out of codepoint_extra_ops.h and into librt_strings.c
as proper non-static functions, exposed to mypyc-compiled callers via
the capsule API the same way every other LibRTStrings_* function works.

This keeps the librt module files independent of mypyc-internal
_extra_ops headers, matching the pattern used by BytesWriter_internal
etc. The cost is one indirect call per use vs. the previous inlined
macro; still substantially faster than the Python method dispatch the
primitives are replacing.

- librt_strings.h: bump LIBRT_STRINGS_API_VERSION 4->5, LIBRT_STRINGS_API_LEN
  14->19.
- librt_strings_api.h: add 5 macro entries for LibRTStrings_API[14..18].
- librt_strings.c: define the 5 helpers; register them in the capsule
  table; drop `#include "codepoint_extra_ops.h"`.
- mypyc/ir/deps.py: delete CODEPOINT_EXTRA_OPS.
- mypyc/primitives/librt_strings_ops.py: drop the CODEPOINT_EXTRA_OPS
  dep from the five codepoint primitives.
- Delete codepoint_extra_ops.{c,h}.
@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

Per the follow-up review on python#21522 (discussion_r3302940360), the
codepoint helpers should be defined `static inline` in librt_strings.h
rather than routed through the capsule API. The librt_strings.h header
is already pulled into both the librt.strings module and every mypyc-
compiled extension that uses any librt.strings primitive (via
librt_strings_api.h), so static inlines there are visible to both sides
with no per-call indirection -- which matters here because the helpers
participate in per-character hot loops where the capsule call overhead
would dwarf the work of a single Py_UNICODE_IS* macro.

This supersedes the capsule approach from the previous commit:

- librt_strings.h: include CPy.h and stdint.h; define the five
  classification helpers (IsSpace, IsDigit, IsAlnum, IsAlpha,
  IsIdentifier) as `static inline`; revert LIBRT_STRINGS_API_VERSION
  5 -> 4 and _API_LEN 19 -> 14.
- librt_strings_api.h: drop the five capsule macro entries.
- librt_strings.c: drop the five out-of-line function bodies and the
  matching capsule table entries; the `cp_*` Python wrappers now reach
  the inlines via the header include.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants